Skip to content

fix: add node health check to incr/decr before mutate operation#186

Open
AustinWheel wants to merge 2 commits intomasterfrom
awheeler/fix/incr-decr-tolerance
Open

fix: add node health check to incr/decr before mutate operation#186
AustinWheel wants to merge 2 commits intomasterfrom
awheeler/fix/incr-decr-tolerance

Conversation

@AustinWheel
Copy link
Collaborator

incr/decr operations bypassed the ensureWriteQueueSize check that other
write paths (set, add, touch) use, causing them to block on unreachable
or stalled nodes for the full mutateOperationTimeout duration. A single
unhealthy node could cause all incr/decr calls to block, exhausting the
caller's thread pool.

This adds the same ensureWriteQueueSize gate to incr/decr. If the target
node is inactive or its write queue is full, the operation returns -1
immediately, consistent with the documented API contract. The existing
reconciliation logic in EVCacheImpl.incr()/decr() already handles -1 as
a zone failure and will sync the value from healthy zones on the next
successful operation.

  incr/decr operations bypassed the ensureWriteQueueSize check that other
  write paths (set, add, touch) use, causing them to block on unreachable
  or stalled nodes for the full mutateOperationTimeout duration. A single
  unhealthy node could cause all incr/decr calls to block, exhausting the
  caller's thread pool.

  This adds the same ensureWriteQueueSize gate to incr/decr. If the target
  node is inactive or its write queue is full, the operation returns -1
  immediately, consistent with the documented API contract. The existing
  reconciliation logic in EVCacheImpl.incr()/decr() already handles -1 as
  a zone failure and will sync the value from healthy zones on the next
  successful operation.
… write path

- EVCacheImpl.decr: fix reconciliation to pick the minimum non-(-1) value
  across nodes instead of the maximum. For decr, the most up-to-date node
  has the lowest value (most decremented), so the old max-pick logic would
  overwrite correctly decremented nodes with stale higher values.

- EVCacheClient.ensureWriteQueueSize: add isAvailable() check before
  entering the retry/sleep loop. This fast-fails writes to inactive nodes
  instead of blocking request threads through 3 retry iterations. Affects
  all write operations that go through ensureWriteQueueSize (incr, decr,
  set, delete, etc.).
@Sunjeet
Copy link
Collaborator

Sunjeet commented Mar 10, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Reviewed the following changes:

  • Node health check (ensureWriteQueueSize) added to incr/decr in EVCacheClient.java -- consistent with existing pattern used by all other write operations
  • Inactive node early-exit added to ensureWriteQueueSize -- correctly avoids retry loop for already-disconnected nodes
  • Decrement reconciliation fix in EVCacheImpl.java -- correctly selects minimum (most-decremented) value across zones instead of maximum

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants